# Multimodal retrieval

Jina Embeddings V4
Jina Embeddings v4 is a general-purpose embedding model designed for multimodal and multilingual retrieval, especially suitable for retrieving complex documents, including visually rich documents containing charts, tables, and illustrations.
Multimodal Fusion Transformers Other
J
jinaai
669
36
CLIP ViT H 14 Laion2b S32b B79k
MIT
This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.
Text-to-Image
C
ModelsLab
132
0
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
recallapp
17
0
Colpali V1.1
MIT
ColPali is a visual retrieval model based on PaliGemma-3B and the ColBERT strategy, used to efficiently index documents from visual features.
Text-to-Image Safetensors English
C
vidore
196
2
Patentclip RN101
MIT
Zero-shot image classification model based on OpenCLIP library, suitable for patent image analysis
Image Classification
P
hhshomee
15
0
CLIP ViT B 32 Laion2b S34b B79k
MIT
CLIP ViT-B/32 model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
C
rroset
48
0
CLIP ViT B 32 DataComp.XL S13b B90k
MIT
This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset, designed for tasks like zero-shot image classification and image-text retrieval.
Text-to-Image
C
laion
12.12k
4
CLIP ViT B 32 256x256 DataComp S34b B86k
MIT
This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset using the OpenCLIP framework at 256x256 resolution, primarily for zero-shot image classification and image-text retrieval tasks.
Text-to-Image
C
laion
4,332
8
CLIP ViT B 16 DataComp.XL S13b B90k
MIT
This is a CLIP ViT-L/14 model trained on the DataComp-1B dataset, supporting zero-shot image classification and image-text retrieval tasks.
Image-to-Text
C
flavour
39.22k
1
CLIP ViT L 14 DataComp.XL S13b B90k
MIT
This model is a CLIP ViT-L/14 trained on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval tasks.
Text-to-Image
C
laion
586.75k
113
CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Soup
MIT
CLIP ConvNeXt-XXLarge model trained on LAION-2B dataset using OpenCLIP framework, the first non-ViT image tower CLIP model achieving >79% ImageNet top-1 zero-shot accuracy
Text-to-Image
C
laion
9,412
22
CLIP Convnext Large D 320.laion2B S29b B131k Ft
MIT
CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks.
Text-to-Image TensorBoard
C
laion
3,810
3
CLIP Convnext Large D 320.laion2B S29b B131k Ft Soup
MIT
CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks
Text-to-Image TensorBoard
C
laion
83.56k
19
CLIP Convnext Large D.laion2b S26b B102k Augreg
MIT
Large-scale ConvNeXt-Large CLIP model trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks
Text-to-Image TensorBoard
C
laion
80.74k
5
CLIP Convnext Base W 320 Laion Aesthetic S13b B82k
MIT
A CLIP model based on the ConvNeXt-Base architecture, trained on a subset of LAION-5B, suitable for zero-shot image classification and image-text retrieval tasks.
Text-to-Image TensorBoard
C
laion
12.67k
3
CLIP Convnext Base W Laion Aesthetic S13b B82k
MIT
CLIP model with ConvNeXt-Base architecture trained on the LAION-Aesthetic dataset, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image TensorBoard
C
laion
703
5
CLIP Convnext Base W Laion2b S13b B82k
MIT
CLIP model based on ConvNeXt-Base architecture, trained on a subset of LAION-5B, supporting zero-shot image classification and image-text retrieval tasks
Text-to-Image
C
laion
4,522
5
CLIP ViT B 16 Laion2b S34b B88k
MIT
A multimodal vision-language model trained on the OpenCLIP framework, completed on the LAION-2B English dataset, supporting zero-shot image classification tasks
Text-to-Image
C
laion
251.02k
33
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
laion
1.1M
112
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase